Search CORE

5 research outputs found

Robust part-of-speech tagging of social media text

Author: Horsmann Tobias
Publication venue
Publication date: 27/04/2018
Field of study

Part-of-Speech (PoS) tagging (Wortklassenerkennung) ist ein wichtiger Verarbeitungsschritt in vielen sprachverarbeitenden Anwendungen. Heute gibt es daher viele PoS Tagger, die diese wichtige Aufgabe automatisiert erledigen. Es hat sich gezeigt, dass PoS tagging auf informellen Texten oft nur mit unzureichender Genauigkeit möglich ist. Insbesondere Texte aus sozialen Medien sind eine große Herausforderung. Die erhöhte Fehlerrate, welche auf mangelnde Robustheit zurückgeführt werden kann, hat schwere Folgen für Anwendungen die auf PoS Informationen angewiesen sind. Diese Arbeit untersucht daher Tagger-Robustheit unter den drei Gesichtspunkten der (i) Domänenrobustheit, (ii) Sprachrobustheit und (iii) Robustheit gegenüber seltenen linguistischen Phänomene. Für (i) beginnen wir mit einer Analyse der Phänomene, die in informellen Texten häufig anzutreffen sind, aber in formalen Texten nur selten bis gar keine Verwendung finden. Damit schaffen wir einen Überblick über die Art der Phänomene die das Tagging von informellen Texten so schwierig machen. Wir evaluieren viele der üblicherweise benutzen Tagger für die englische und deutsche Sprache auf Texten aus verschiedenen Domänen, um einen umfassenden Überblick über die derzeitige Robustheit der verfügbaren Tagger zu bieten. Die Untersuchung ergab im Wesentlichen, dass alle Tagger auf informellen Texten große Schwächen zeigen. Methoden, um die Robustheit für domänenübergreifendes Tagging zu verbessern, sind prinzipiell hilfreich, lösen aber das grundlegende Robustheitsproblem nicht. Als neuen Lösungsansatz stellen wir Tagging in zwei Schritten vor, welches eine erhöhte Robustheit gegenüber domänenübergreifenden Tagging bietet. Im ersten Schritt wird nur grob-granular getaggt und im zweiten Schritt wird dieses Tagging dann auf das fein-granulare Level verfeinert. Für (ii) untersuchen wir Sprachrobustheit und ob jede Sprache einen zugeschnittenen Tagger benötigt, oder ob es möglich ist einen sprach-unabhängigen Tagger zu konstruieren, der für mehrere Sprachen funktioniert. Dazu vergleichen wir Tagger basierend auf verschiedenen Algorithmen auf 21 Sprachen und analysieren die notwendigen technischen Eigenschaften für einen Tagger, der auf mehreren Sprachen akkurate Modelle lernen kann. Die Untersuchung ergibt, dass Sprachrobustheit an für sich kein schwerwiegendes Problem ist und, dass die Tagsetgröße des Trainingskorpus ein wesentlich stärkerer Einflussfaktor für die Eignung eines Taggers ist als die Zugehörigkeit zu einer gewissen Sprache. Bezüglich (iii) untersuchen wir, wie man mit seltenen Phänomenen umgehen kann, für die nicht genug Trainingsdaten verfügbar sind. Dazu stellen wir eine neue kostengünstige Methode vor, die nur einen minimalen Aufwand an manueller Annotation erwartet, um zusätzliche Daten für solche seltenen Phänomene zu produzieren. Ein Feldversuch hat gezeigt, dass die produzierten Daten ausreichen um das Tagging von seltenen Phänomenen deutlich zu verbessern. Abschließend präsentieren wir zwei Software-Werkzeuge, FlexTag und DeepTC, die wir im Rahmen dieser Arbeit entwickelt haben. Diese Werkzeuge bieten die notwendige Flexibilität und Reproduzierbarkeit für die Experimente in dieser Arbeit.Part-of-speech (PoS) taggers are an important processing component in many Natural Language Processing (NLP) applications, which led to a variety of taggers for tackling this task. Recent work in this field showed that tagging accuracy on informal text domains is poor in comparison to formal text domains. In particular, social media text, which is inherently different from formal standard text, leads to a drastically increased error rate. These arising challenges originate in a lack of robustness of taggers towards domain transfers. This increased error rate has an impact on NLP applications that depend on PoS information. The main contribution of this thesis is the exploration of the concept of robustness under the following three aspects: (i) domain robustness, (ii) language robustness and (iii) long tail robustness. Regarding (i), we start with an analysis of the phenomena found in informal text that make tagging this kind of text challenging. Furthermore, we conduct a comprehensive robustness comparison of many commonly used taggers for English and German by evaluating them on the text of several text domains. We find that the tagging of informal text is poorly supported by available taggers. A review and analysis of currently used methods to adapt taggers to informal text showed that these methods improve tagging accuracy but offer no satisfactory solution. We propose an alternative tagging approach that reaches an increased multi-domain tagging robustness. This approach is based on tagging in two steps. The first step tags on a coarse-grained level and the second step refines the tags to the fine-grained tags. Regarding (ii), we investigate whether each language requires a language-tailored PoS tagger or if the construction of a competitive language independent tagger is feasible. We explore the technical details that contribute to a tagger's language robustness by comparing taggers based on different algorithms to learn models of 21 languages. We find that language robustness is a less severe issue and that the impact of the tagger choice depends more on the granularity of the tagset that shall be learned than on the language. Regarding (iii), we investigate methods to improve tagging of infrequent phenomena of which no sufficient amount of annotated training data is available, which is a common challenge in the social media domain. We propose a new method to overcome this lack of data that offers an inexpensive way of producing more training data. In a field study, we show that the quality of the produced data suffices to train tagger models that can recognize these under-represented phenomena. Furthermore, we present two software tools, FlexTag and DeepTC, which we developed in the course of this thesis. These tools provide the necessary flexibility for conducting all the experiments in this thesis and ensure their reproducibility

Duisburg-Essen Publications Online

Reliable Part-of-Speech Tagging of Low-frequency Phenomena in the Social Media Domain

Author: Beißwenger Michael
Horsmann Tobias
Zesch Torsten
Publication venue: cmc-corpora conference series
Publication date
Field of study

We present a series of experiments to fit a part-of-speech (PoS) tagger towards tagging extremely infrequent PoS tags of which we only have a limited amount of training data. The objective is to implement a tagger that tags this phenomenon with a high degree of correctness in order to be able to use it as a corpus query tool on plain text corpora, so that new instances of this phenomenon can be easily found in plain text. We focused on avoiding manual annotation as much as possible and experimented with altering the frequency weight of the PoS tag of interest in the small training data set we have. This approach was compared to adding machine tagged training data in which only the phenomenon of interest is manually corrected. We find that adding more training data is unavoidable but machine tagging data and hand correcting the tag of interest is sufficient. Furthermore, the choice of the tagger plays an important role as some taggers are equipped to deal with rare phenomena more adequately than others. The best trade off between precision and recall of the phenomenon of interest was achieved by a separation of the tagging into two steps An evaluation of this phenomenon-fitted tagger on social media plain-text confirmed that the tagger serves as a useful corpus query tool that retrieves instances of the phenomenon including many unseen ones

ZENODO

Connecting Resources: Which Issues Have to be Solved to Integrate CMC Corpora from Heterogeneous Sources and for Different Languages?

Author: Beisswenger Michael
Etienne Carole
Fišer Darja
Grumt Suárez Holger
Herzberg Laura
Hinrichs Erhard
Ho-Dac Lydia-Mai
Horsmann Tobias
Karlova-Bourbonus Natali
Lemnitzer Lothar
Longhi Julien
Lüngen Harald
Parisse Christophe
Poudat Céline
Schmidt Thomas,
Stemle Egon
Storrer Angelika
Wigham Ciara
Zesch Torsten
Publication venue: HAL CCSD
Publication date: 03/10/2017
Field of study

International audienceThe paper reports on the results of a scientific colloquium dedicated to the creation of standards and best practices which are needed to facilitate the integration of language resources for CMC stemming from different origins and the linguistic analysis of CMC phenomena in different languages and genres. The key issue to be solved is that of interoperability-with respect to the structural representation of CMC genres, linguistic annotations metadata, and anonymization/pseudonymization schemas. The objective of the paper is to convince more projects to partake in a discussion about standards for CMC corpora and for the creation of a CMC corpus infrastructure across languages and genres. In view of the broad range of corpus projects which are currently underway all over Europe, there is a great window of opportunity for the creation of standards in a bottom-up approach

Hal-Diderot

Connecting Resources: Which Issues Have to be Solved to Integrate CMC Corpora from Heterogeneous Sources and for Different Languages?

Author: Beisswenger Michael
Etienne Carole
Fišer Darja
Grumt Suárez Holger
Herzberg Laura
Hinrichs Erhard
Ho-Dac Lydia-Mai
Horsmann Tobias
Karlova-Bourbonus Natali
Lemnitzer Lothar
Longhi Julien
Lüngen Harald
Parisse Christophe
Poudat Céline
Schmidt Thomas, C.
Stemle Egon
Storrer Angelika
Wigham Ciara
Zesch Torsten
Publication venue: HAL CCSD
Publication date: 01/01/2017
Field of study

HAL-ENS-LYON

Scientific Publications of the University of Toulouse II Le Mirail

HAL Clermont Université

MAnnheim DOCument Server

HAL

Publikationsserver des Instituts für Deutsche Sprache

Hal-Diderot

Natural Language Processing for Social Media, Second Edition

Author: Abdul-Mageed Muhammad
Akhtar Md Shad
Al-Gaphari Galeb H.
Ali Tanveer
Allan James
Allan James
Artstein Ron
Arunachalam Ravi
Atefeh Farzindar
Avudaiappan Neela
Baccianella Stefano
Bakr Hitham Abo
Balasubramanyan Ramnath
Baldwin Timothy
Balikas Georgios
Barman Utsab
Baroni Marco
Becker Hila
Becker Hila
Becker Hila
Bellaachia Abdelghani
Benson Edward
Benton Adrian
Berger Adam L.
Bergsma Shane
Bermingham Adam
Beverungen Gary
Bing Li
Bizer Christian
Blei David M.
Bollen Johan
Bollen Jonah
Bontcheva Kalina
Boujelbane Rahma
Brantingham Richard
Brew Chris
Burfoot Clinton
Caragea Cornelia
Carletta Jean
Carter Simon
Celli Fabio
Chen Hailiang
Chen Hailiang
Chen Zheng
Chilet Jorge Ale
Choudhury Munmun De
Colbaugh Richard
Coppersmith Glen
Cordeiro Mário
Cucerzan S.
Cunningham Hamish
Daumé Hal
Davidov Dmitry
Debnath Pragna
Delort Jean-Yves
Demir Seniz
Derczynski Leon
Derczynski Leon
Diab Mona
Diana Inkpen
Dlugolinský Stefan
Dodds Peter Sheridan
Dredze Mark
Duan Yajuan
Dunning Ted
Eisenstein Jacob
Eisenstein Jacob
Eisenstein Jacob
Eisenstein Jacob
Ekman Paul
Elfardy Heba
Farzindar Atefeh
Farzindar Atefeh
Farzindar Atefeh
Ferragina Paolo
Fokkens Antske
Ford Dominey Peter
Foster George
Foster Jennifer
Friedman Jerome H.
Gella Spandana
Ghazi Diman
Gil Gonzalo Blazquez
González-Ibáñez Roberto
Gotti Fabrizio
Gotti Fabrizio
Guo Weiwei
Habash Nizar
Han Bo
Han Bo
Han Bo
Harabagiu Sanda
Harrison Phillip G.
He Hangfeng
Hecht Brent
Henrich Verena
Heravi Bahareh Rahmanzadeh
Hoffart Johannes
Holzman Lars E.
Horsmann Tobias
Howes Christine
Hsieh Wen-Tai
Hu Meishan
Huang Fei
Imran Muhammad
Inouye David
Izard Caroll E.
Jehl Laura
Jehl Laura Elisabeth
Jin Xiaotian
Judd Joel
Kashyap Ranjitha
Khabiri Elham
Khan Mohammad
Kim Sang Erik Tjong
Kokkos Athanasios
Lafferty John D.
Lampos Vasileios
Leonard
Lewis Will
Li Jiwei
Li Jiwei
Li Jiwei
Limsopatham Nut
Lin Hui
Ling Wang
Liu Bing
Liu Ji
Liu Wendy
Liu Xiaohua
Liu Xiaohua
Llewellyn Clare
Long Rui
Lui Marco
Lui Marco
Lukin Stephanie
Lösch Uta
Ma Jing
Mao Huina
Marchetti-Bowick Micol
Marcus Mitchell P.
Margaret
Maynard Diana
Metzler Donald
Mishne Gilad
Moghaddam Samaneh
Mohammad Saif M.
Mohammad Saif M.
Mohammady Ehsan
Mohay George
Moro Andrea
Mubarak Hamdy
Munro Robert
Neviarouskaya Alena
Nguyen Dong
Nikfarjam Azadeh
O'Connor Brendan
Oberlander Jon
Ovrelid Lilja
Owoputi Olutobi
Pajzs Julia
Pak Alexander
Pak Alexander
Paranjpe Deepa
Park Minsu
Paul Michael
Peng Fuchun
Peng Nanyun
Pennebaker James W.
Persing Isaac
Petrovic Sasha
Pla Ferran
Plutchik Robert
Poese Ingmar
Popescu Adrian
Popescu Ana-Maria
Porshnev Alexabder
Power Robert
Prapula G.
Ramage Daniel
Rao Delip
Razmara Majid
Riloff Ellen
Ritter Alan
Roller Stephen
Rowe Matthew
Rubin Victoria
Sawaf Hassan
Schler Jonathan
Seddah Djamé
Shaalan Khaled
Shamma D. A.
Sharifi Beaux
Shickel Benjamin
Simsek M. U.
Sinha Priyanka
Sokolova Marina
Strapparava Carlo
Strapparava Carlo
Sul Hong Keel
Titov Ivan
Tkachenko Alexander
Tromp Erik
Uzuner Özlem
Vallor Shannon
Verma Sudha
Wan Stephen
Wang Na
Wang Pidong
Washington
Weerkamp Wouter
Weng Jianshu
William
Wing Benjamin
Witten Ian
Wu Wei
Xie Wei
Yan Rui
Yang Steve Y.
Zbib Rabih
Zesch Torsten
Zhao Wayne Xin
Zhou Liang
Zhou Ning
Publication venue: 'Morgan & Claypool Publishers LLC'
Publication date
Field of study

Crossref